AITopics | quantization technique

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

Neural Information Processing SystemsMar-22-2026, 22:51:37 GMT

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for repetitive computations and thereby lowering latency in autoregressive generation. However, the size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. In this paper, we present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective, significantly reducing the memory footprint for LLM inference. Our approach is based on the observation that KV cache states exhibit high similarity between the adjacent layers in the middle-to-deep portion of LLMs.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training

Neural Information Processing SystemsMar-17-2026, 01:04:17 GMT

Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiveQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiveQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiveQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5x without noticeable accuracy loss, and reduce the end-to-end training time by almost 50%. The results also show that \GradiveQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

Appendix

Neural Information Processing SystemsFeb-8-2026, 13:05:31 GMT

The computation or estimation of the smoothness matrixLi requires addiotional preprocessing.

artificial intelligence, machine learning, quant, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training

Neural Information Processing SystemsNov-20-2025, 23:01:26 GMT

Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiveQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiveQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiveQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5x without noticeable accuracy loss, and reduce the end-to-end training time by almost 50%. The results also show that \GradiveQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.

bandwidth-efficient gradient aggregation, gradiveq, vector quantization, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.60)

Add feedback

Fibbinary-Based Compression and Quantization for Efficient Neural Radio Receivers

Fiandaca, Roberta, Gomony, Manil Dev

arXiv.org Artificial IntelligenceNov-5-2025

Neural receivers have shown outstanding performance compared to the conventional ones but this comes with a high network complexity leading to a heavy computational cost. This poses significant challenges in their deployment on hardware-constrained devices. To address the issue, this paper explores two optimization strategies: quantization and compression. We introduce both uniform and non-uniform quantization such as the Fibonacci Code word Quantization (FCQ). A novel fine-grained approach to the Incremental Network Quantization (INQ) strategy is then proposed to compensate for the losses introduced by the above mentioned quantization techniques. Additionally, we introduce two novel lossless compression algorithms that effectively reduce the memory size by compressing sequences of Fibonacci quantized parameters characterized by a huge redundancy. The quantization technique provides a saving of 45\% and 44\% in the multiplier's power and area, respectively, and its combination with the compression determines a 63.4\% reduction in memory footprint, while still providing higher performances than a conventional receiver.

artificial intelligence, machine learning, quantization, (14 more...)

arXiv.org Artificial Intelligence

2511.01921

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.50)
Media > Radio (0.40)

Technology:

Information Technology > Communications > Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.31)

Add feedback

QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

Ahmad, Muhammad, Mazher, Khurram, Akram, Saqib, Tameem, Ahmad, Nasir, Saad Bin

arXiv.org Artificial IntelligenceSep-15-2025

We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. We further integrate one particular technique from QuantX into the popular Llama.cpp framework and show its feasibility in terms of runtime compared to the mainstream quantization techniques from Llama.cpp. Lastly, this manuscript provides insights into the LLM quantization process that motivated the range of recipes and options that are incorporated in QuantX.

large language model, machine learning, quantization, (18 more...)

arXiv.org Artificial Intelligence

2505.07531

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.40)

Add feedback

401de6a666d7672757bdadfc53c3c123-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 09:22:44 GMT

block quant, iteration transmitted megabyte time, quant, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reviews: Normalization Helps Training of Quantized LSTM

Neural Information Processing SystemsJun-2-2025, 00:28:00 GMT

While in recent years a number of extreme low-precision quantization techniques were developed for DNNs, they were not directly applicable to recurrent architectures. In recent work [1] an extreme low-precision quantization method was proposed that utilizes batch normalization and compresses recurrent neural networks without a large drop in accuracy achieving state-of-the-art performance. In this paper, the authors proposed a theoretical explanation of the difficulties of training LSTMs with low-precision weights and practically explored a combination of different normalization techniques with different quantization schemes. The authors experimentally showed that simple introduction of weight or layer normalization allows applying many standard quantization techniques without modifications. Comments, suggestions, and questions: Figure 1 is quite difficult to read.

artificial intelligence, deep learning, machine learning, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

Neural Information Processing SystemsMay-27-2025, 21:58:05 GMT

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for repetitive computations and thereby lowering latency in autoregressive generation. However, the size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. In this paper, we present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective, significantly reducing the memory footprint for LLM inference. Our approach is based on the observation that KV cache states exhibit high similarity between the adjacent layers in the middle-to-deep portion of LLMs.

compression ratio, kv cache compression, minicache, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Is (Selective) Round-To-Nearest Quantization All You Need?

Kogan, Alex

arXiv.org Artificial IntelligenceMay-23-2025

Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our results, we argue that RTN presents a viable and practical choice for quantizing LLMs.

large language model, machine learning, quantization, (21 more...)

arXiv.org Artificial Intelligence

2505.15909

Genre: Research Report > New Finding (0.48)

Technology: